Accuracy, potential, and limitations of probabilistic record linkage in identifying deaths by gender identity and sexual orientation in the state of Rio De Janeiro, Brazil

Background Globally, the counting of deaths based on gender identity and sexual orientation has been a challenge for health systems. In most cases, non-governmental organizations have dedicated themselves to this work. Despite these efforts in generating information, the scarcity of official data presents significant limitations in policy formulation and actions guided by population needs. Therefore, this manuscript aims to evaluate the accuracy, potential, and limits of probabilistic data relationships to yield information on deaths according to gender identity and sexual orientation in the State of Rio de Janeiro. Methods This study evaluated the accuracy of the probabilistic record linkage to obtain information on deaths according to gender and sexual orientation. Data from two information systems were used from June 15, 2015 to December 31, 2020. We constructed nine probabilistic data relationship strategies and identified the performance and cutoff points of the best strategy. Results The best data blocking strategy was established through logical blocks with the first and last names, birthdate, and mother’s name in the pairing strategy. With a population base of 80,178 records, 1556 deaths were retrieved. With an area under the curve of 0.979, this strategy presented 93.26% accuracy, 98.46% sensitivity, and 90.04% specificity for the cutoff point ≥ 17.9 of the data relationship score. The adoption of the cutoff point optimized the manual review phase, identifying 2259 (90.04%) of the 2509 false pairs and identifying 1532 (98.46%) of the 1556 true pairs. Conclusion With the identification of possible strategies for determining probabilistic data relationships, the retrieval of information on mortality according to sexual and gender markers has become feasible. Based on information from the daily routine of health services, the formulation of public policies that consider the LGBTQ + population more closely reflects the reality experienced by these population groups. Supplementary Information The online version contains supplementary material available at 10.1186/s12889-024-19002-x.


Background
Mortality has been considered an important indicator in the context of public health, especially in the area of health surveillance and for public policymakers in Brazil and worldwide.It is not by chance that numerous studies address this topic [1][2][3][4].This is because monitoring of death rates not only helps to identify the patterns and causes of death of population groups but also is useful in defining priorities for the allocation of resources and evaluating the quality of life and well-being of populations, in addition to allowing the measurement of the efficiency of the programs implemented [5,6].Given the importance of monitoring these data, since the 1970s Brazil has implemented one of the most robust information systems for monitoring deaths, the Sistema de Informação sobre Mortalidade (Mortality Information System -SIM).Under the administration of the Brazilian Ministry of Health and fully integrated into the Sistema Único de Saúde (Unified Health System), SIM can publish detailed data on the cause of death using the International Classification of Diseases-10 and several individual and clinical characteristics [7][8][9].
On the other hand, it is still not possible to say that information on deaths truly reaches all population groups.The literature is vast that points out that lesbian, gay, bisexual, travesti, transgender, and queer individuals (LGBTQ+; using the "plus" sign to represent the broad variety of sexual and gender identities) have worse health indicators than others group [10][11][12][13][14][15].However, SIM still does not monitor specific information on sexual orientation and gender identity.Thus, travestis and trans women who were unable (or did not have the desire to) to rectify their names had their death records classified as belonging to the male population.The same occurs for trans men, who are classified as part of the female population.This is because only the sex assigned at birth and recorded in Brazilian administrative systems is identified in SIM [16,17].This is the paradox of the country that, according to nongovernmental agencies, murders the most LGBTQ + people worldwide [18].
Regarding the travesti population, a group which may be unfamiliar to parts of the international community, it is crucial to underscore that this is a demographic group with a gender identity that is unique to Brazil.Travestis express and embody a feminine gender presentation equal to that of cisgender women, despite being assigned a different gender at birth.The travesti community is notable for its long-standing struggle for rights and its resistance against frequent marginalization [15].In recognition of the importance of gender self-identification and the diverse expressions thereof, this manuscript will use the term "travesti" to discuss such identities, comprehending the unique socio-cultural context of their experiences.
In the absence of specific information and due to the urgency of providing data that support specific public policies for these groups, the National Association of Travestis and Transgenders (ANTRA -Associação Nacional de Travestis e Transexuais), as well as gay activist groups, monitors deaths based on newspaper reports and information captured from hospital nets [18].These cases, exclusively focused on murders, represent only a small part of the concrete reality experienced by these people.In other words, the country is very far from being accurate with the mortality profile of LGBTQ + populations, harming everything from the design of public policies to the direct care provided by healthcare professionals (still under the haze of the unknown due to the absence of data).
Since June 15, 2015, information on sexual orientation and gender identity in the Brazilian health information systems, has been systematically collected using only the interpersonal and self-inflicted violence module of the Sistema de Informação de Agravos de Notificação (Notifiable Diseases Information System -SINAN) [19].All suspected or confirmed violent events are recorded, providing a way to expand the dataset on the LGBTQ + population.With possible inaccuracies, (due to the absence of a unique key to identify the population in information systems), the possibility of a probabilistic relationship of data may suggest a tool to estimate deaths within the Brazilian LGBTQ + population.Data relationships are a statistical tool for combining different sets of information: deterministically, when there is a unique key (identifying variable) for linking, or probabilistically when keys are used to estimate the probability that records with different names refer to the same person [20][21][22][23][24][25].As a way of contributing to the systematic integration of the technique in health services, this study aims to evaluate the accuracy, potential, and limits of the probabilistic relationship of data to obtain information on deaths according to gender identity and sexual orientation in the State of Rio de Janeiro.identity and sexual orientation on notification and mortality due to violence: a cohort study." Therefore, this manuscript presents a cross-sectional study that quantifies the accuracy of probabilistic data linkage to offer insights into mortality rates according to gender identity and sexual orientation.The analysis incorporates data from individuals registered in SINAN for interpersonal and self-inflicted violence, as well as data from the SIM of 92 municipalities in the state of Rio de Janeiro, Brazil, for the period from June 15, 2015, to December 31, 2020.

Population and selection criteria
To determine the probabilistic relationships between the databases, the SINAN records and occurrence of interpersonal/self-inflicted violence between June 15, 2015 and December 31, 2020.The beginning of this period (June 15, 2015) was established based on the introduction date of collecting the variables 'sexual orientation' and 'gender identity' in the SINAN.
The records of adults aged 19 to 59 years were included.Records without information or with invalid data on birthdate, name, mother's name, and date of occurrence of violent acts were excluded.For SIM qualification, records without identification or with invalid names or mothers' names (e.g., indigent, male, black male, white female, etc.) were excluded.To reduce the number of records and accelerate the data relationship process, records of people younger than 19 years of age were excluded, as this was the criterion for selection in SINAN.

Linkage key variables
The variables used as link keys between the databases were "name", "mother's name", and "birthdate" from the SINAN dataset.Additionally, the variables "notification number", "date of notification", and "date of occurrence" were sourced from SINAN, while "death number", and "date of death" were obtained from the SIM dataset.These variables were utilized to construct an identifier code and to verify duplicate records.

Variables of interest
The main variable under investigation was death, which was identified and retrieved from the SIM through a linkage process between databases.All deaths recorded during the study period (from June 15, 2015, to December 31, 2020), regardless of cause, were included in this analysis.
Variables such as sex, sexual orientation, and gender identity, which were of interest in this investigation, were synthesized to build analytical groups approximating the realities of exposure of people to the concrete reality.Originally present in SINAN, the possible entries included female (S0) and male (S1) for the variable "sex"; heterosexual (O0), bisexual or homosexual (O1), and "unknown" or "omitted" (O2) for "sexual orientation"; and, for the gender variable, it is important to consider that the registration was made only for "travestis" and "transgender women" (G1) and "transgender men" (G2), with the interpretation that data "unknown" or "omitted" (G0) referred to the cisgender population, despite the validity problems that this may produce [17].
Thus, eight groups were created, as demonstrated by Fig. 1.The participants were cisgender and heterosexual women (S0 + O0 + G0), women who had sex with women (S0 + O1 + G0), cisgender women of unknown sexual orientation (S0 + O2 + G0), travestis and transgender women (G1), cisgender and heterosexual men (S1 + O0 + G0), men who had sex with men (S1 + O1 + G0), cisgender men with unknown sexual orientation (S1 + O2 + G0), and transgender men (G2).Due to the limited number of cases retrieved from the databases, the sexual orientations of transgender people were not combined in the groups formed, limiting the definition of interest groups.

Population characteristics variables
Drawing from the SINAN database, variables were selected to characterize the investigated population.The "age group" was defined by categorizing the continuous numerical variable "age" into four adult strata.The other variables used to characterize the population included "marital status", "schooling", and "self-declared race/ color".The variables "person with a disability" and "people with mental or behavioral disorders" were considered as "yes" if at least one disability, and mental or behavioral disorder respectively, were documented in the database.Similarly, the "referral to the care network" was marked as "yes" if there was at least one referral to social, legal, or healthcare services documented.The "chronicity of violent episodes" variable was constructed based on the repeated entries in the SINAN database, as explained in the section on probabilistic data linkage.

Routines of the probabilistic relationship of data Stage 1: preprocessing
The first stage, titled preprocessing, was intended for the preparation and standardization of the databases for the probabilistic relationship of the data [21].In this stage, computational scripts were built in Python 3.11.2software in the Visual Studio Code 1.84.2 environment.Initially, the databases were reduced, leaving only the variables of interest for linking the databases (notification number/death number, date of notification, date of occurrence/date of death, name, mother's name, and birthdate) to facilitate the process.The study selection criteria were also applied, and unique keys were constructed for each record in the database.This construction was necessary because, at least in SINAN, the notification number, which should be unique for each record, shares the same number with other people.In other words, different people had the same notification number without being homonymous.Thus, the unique key was generated based on the combination of the notification (or death) number, birthdate, first and last name, and mother's name of each person.
With the expectation of reducing potential spelling errors in name records, a set of standardizations was implemented.This stage of standardizing letters and dates serves as a critical phase in the preprocessing for probabilistic linkage.Specifically, in the case of names, variations in spelling and the presence of special characters can increase discrepancies between records across two databases even when they pertain to the same individual.Consequently, standardizing records, particularly in personal identification, enhances the likelihood of matching between two distinct data sources [21,23,25].
To handle last names in the preprocessing phase, the names were separated into fragments (first name, middle name, and last name) to optimize the next steps of building data relationships.Dates were standardized to write the day, month, and year as two, two, and four numerals, without separation by slashes.

Stage 2: initial deduplication
The following steps, which included the completion of database preparation and the steps related to the probabilistic relationship of the data itself, were conducted in Link Plus version 3.0.This software, developed by the USA's Centers for Disease Control and Prevention, was initially designed for the probabilistic relationship of cancer registries in the USA [26].However, in the field of public health, its use has spread in other contexts and countries, including Brazil [22,25,[27][28][29].
Thus, to finalize the preparation of the databases, the deduplication technique was applied due to the possibility of multiple records in SINAN since the recurrence of different violent events throughout life and multiple notifications of the same events are quite likely.Events may be recorded by more than one professional or health unit.The deduplication technique consists of verifying these repeated records in a database [21,23].Cases of duplicity were treated according to the following rules: (1) When the occurrence number was the same, with the same notification and occurrence dates, victim's name, mother's name, and birthdate, even allowing for spelling variations indicative of error, the record was considered the same, so only one such record was selected.(2) When the date of occurrence was different but the name, mother's name, and birthdate were the same, even with the presence of spelling changes, it was considered a recurrence of the event, which we treated as a recurrent/chronic case.In this second rule, only the oldest record was selected, and we counted the number of repeated events in a new variable named "chronicity of violent episodes".

Stage 3: blocking and matching
After standardization and preprocessing were completed, the blocking and pairing stage began.Blocking is a method of creating logical subsets based on specific criteria of variables to reduce the number of comparisons during the database relationship.This reduces the processing time.Pairing is intended to compare the corresponding records in the databases using preestablished algorithms in the relationship software [21,23].
In the specific case of this study, nine blocking and pairing strategies were compared.The blocking strategies combined the first name (FN), last name (LN), mother's first name (FM), mother's last name (LM), and birthdate (BD) using the soundex code to reduce potential spelling errors and nominal variations.For the pairing strategies, combinations of the full name (N), mother's name (M), and birthdate (BD) were used, defining the minimum probabilities of agreement (M-probability) as 0.95 and 0.65 for the name and birthdate, respectively.In addition, the direct method of pairing and an initial cutoff point of 10 were used for the scores calculated from the data.

Stage 4: manual review of pairs
A manual peer review was performed for each strategy.The defining criteria for true pairs were as follows: (1) same name, birthdate, and mother's name; (2) same name, mother's name, and day and month of birth but with a variation of ± 2 years for the last digit of the year of birth; (3) same name, mother's name, and year of birth, with inversion of the day and month digits; (4) same name, mother's name, and day and year of birth; (5) same name, mother's name, and month and year of birth; (6) similar name and mother's name with variation in the orthographic scan; and (7) unusual name and mother's name (foreign names, for example) that agree on the date of birth.

Stage 5: Postpairing
In the postpairing phase, a new deduplication was applied to the sets of pairs formed in each strategy.The selection of the best matching and blocking strategy was based on the shortest processing time and the most true pairs after deduplication.Once the best strategy was found, the files were combined based on the unique key constructed during the preparation phase, thus joining the true pairs (deaths) with the SINAN database and all its variables.The same occurred with the SIM variables, specifically concerning true pairs.The bank combinations were performed using Stata SE 15 software.

Statistical analysis
We analyzed the processing times, minimum and maximum scores, absolute numbers, and proportions of true pairs formed by each blocking and pairing strategy in the probabilistic data relationship.We determined the criteria that composed the optimal strategy.We computed the areas under the receiver operating characteristic (ROC) curves to determine the best cutoff point for the matching scores.In addition, the sensitivity (%), specificity (%), accuracy (%), positive and negative likelihood ratios, number of true pairs, and number of false pairs were calculated for each cutoff point of the scores of the optimal strategy.
After integrating the SINAN database with the true pairs data from the SIM, and determining the optimal matching strategy, we calculated the proportions for each variable of interest in the study.This process also involved determining the number of deaths from any cause within this specific group and calculating the mortality rates per thousand individuals.These rates were derived by comparing the total number of deaths identified through the linkage of the SIM and SINAN databases against the records of violence victims listed in SINAN.This approach has allowed us to gauge the mortality among violence victims recorded in SINAN, without intending to establish a direct causality between the violent incidents occurring over time and death.
We then analyzed these rates to identify differences between the initial cut-off point and the one chosen through the most effective matching strategy.In addition, with the objective of observing the existence of differences between the cutoff points, the proportion of agreement, the kappa statistic, and the respective p values were calculated for each covariate of the study.The interpretation of the kappa statistic was classified as full agreement (0.81-1.00), substantial agreement (0.61-0.80), moderate agreement (0.41-0.60), fair agreement (0.21-0.40), slight agreement (0-0.20), or no agreement (< 0) [30,31].All analyses were performed using Stata SE 15 software.

Ethical considerations
Our study was conducted following Brazilian ethical guidelines for research involving human subjects.The State University of Rio de Janeiro Research Ethics Committee (Rio de Janeiro, Brazil) waived the need for informed consent, and reviewed and approved the study under protocol number 5,009,244.

Data preparation and matching strategy
Figure 2 shows the sample composition that formed the baseline of the SINAN probabilistic relationship of interpersonal and self-inflicted violence, as well as the SIM rating.Note that 60.35% (n = 128,620) of the records were excluded from SINAN based on the selection criteria.In addition, 3639 cases of notification recurrence were identified-i.e., when the case was the same but with different events and dates.In this case, only one record was kept, for which purpose we created a specific variable, which represented the chronicity of violent episodes.Thus, the baseline consisted of 80,178 records.In turn, SIM had 90,279 (10.12%) exclusions of records based on the qualification criteria of the database, totaling 737,493 death records (Fig. 1).
Table 1 presents the results of the probabilistic relationship according to the 9 blocking and pairing strategies.strategy 1, although not the one with the highest percentage of true positives after deduplication, had the  highest absolute number of true pairs (n = 1156) and the shortest data processing time (11 min and 50 s).Therefore, according to the preestablished criteria, the best strategy for further analysis was strategy 1.

Performance of the best matching strategy
Figure 3 shows the ROC curve of strategy 1.With an area under the curve of 0.979 (95% confidence interval: 0.976-0.983),the probabilistic relationship has an excellent ability to discriminate between true and false pairs.Table 2  lists the properties of the strategy.When a cutoff point of 17.9 was used for the score, the sensitivity was 98.46%, the specificity was 90.04%, and the positive and negative likelihood ratios were 9.84 and 0.02, respectively.This means this cutoff point allowed the correct identification of 98.46% of the true pairs and 93.26% of the false pairs.In addition, the likelihood ratio values show that a positive rating in the relationship significantly increased the chances of the pair being truly positive, while a negative classification, due to the very low negative likelihood ratio, provided strong evidence that the pair was not a true pair.Finally, when analyzing the accuracy, we observed that this cutoff point could correctly classify 93.26% of the observations, i.e., 1532 (98.46%) of the 1556 true pairs, as well as correctly classify 2259 (90.04%) of the 2509 false pairs, greatly reducing the need for manual review after the probabilistic relationship of the data.

Characteristics of population, mortality rates, and linkage cutoff agreement
Table 3 presents the characteristics of the study population; the crude mortality rates for the two cutoff points of the scores generated in the probabilistic relationship of the data; and the kappa statistic for each population subgroup.The study population was mostly composed of women (76.72%), whether cisgender or transgender, aged up to 39 years (73.74%),black (65.46%), single (45.56%), and having more than 8 years of school (66.23%).A total of 2.31% of people with disabilities were identified, 4.77% of the population reported some mental or behavioral disorders, and 3.77% had chronic episodes of violence.Most patients (78.12%) had been referred to the assistance network.Compared with women, mortality rates were ~ 3 times greater in men.Notably, after excluding groups of women with an unknown sexual orientation, the women who had sex with women and travestis and transgender women had higher mortality rates than did cisgender women.Higher mortality rates are also observed with advancing age; in black, married and nonsingle people; with less schooling; in people with disabilities or mental and behavioral disorders; and in cases of chronic violence.In contrast, people who received referrals to the health care network had lower mortality rates.With the kappa statistic indicating full agreement for all covariates (p < 0.001), the adoption of the cutoff point of ≥ 17.90 did not produce substantial differences in the interpretation of the results.

Discussion
The main results of this study are focused on demonstrating the feasibility and good accuracy of determining the probabilistic relationship of data between the SINAN database of interpersonal and self-inflicted violence and the SIM database to obtain information on sexual orientation and gender identity.Keeping a certain pioneering spirit, our findings make a significant contribution to addressing critical gaps on this subject both in Brazil and globally.As noted, the utilization of data linkage techniques allows for the retrieval of information about gender and sexuality markers, a facet still relatively underexplored in research.Determining a linkage strategy, with parsimonious parameters for identifying true and false pairs, tends to contribute to the use of the technique in the daily routine of health services and in academic production.By adopting preprocessing routines and considering the relationship time as a function of the number of true pairs detected, the recognition of the best blocking and processing strategies constitutes an advance for the systematic adoption of this method in the daily life of the services, especially when considering sexual orientation and gender identity as exposure variables in future analyses.These procedures resulted in a true-pair detection rate with proportions compatible with those found in other investigations whose outcome variable was the registration of death in SIM [20,32].
On the other hand, it is important to note that we are faced with slightly different connecting keys between these investigations.In this article, the linking keys were restricted to the variable's "name", "birthdate", and "mother's name".Although the municipality of residence is traditionally used in probabilistic relationships, in the specific case of our study, this variable could bring more elements of uncertainty than solutions.This is because when reaching groups such as travestis and transgender women, as well as people known to be victims of violence, depending on the population base used, it is imperative to consider the possibility of constant changes in addresses over time due to stigma and the structure of violence itself that surrounds them [33].There has been a consensus in Brazilian production that housing instability is a factor present in the lives of a significant portion of travestis and transgender women [10,34].These distinctions in the probabilistic data matching strategies necessitate caution when comparing the data linkage results of different studies.
Complementing this analysis, the definition of a cutoff point in the scores of the probabilistic relationship of data constitutes another important aspect for the advancement of the adoption of this technique in analyses both by health surveillance services and by the academic sector.Recognizing the cutoff point, the manual peer review phase tends to be more efficient [4,24], significantly reducing processing time and improving accuracy in future data relationships.The adoption of scores ≥ 17.9 reduced 56.16% of the pairs to be reviewed, with 90.04% of the pairs being truly negative.In other words, with high specificity (> 90.00%) and with the loss of only 25 true pairs erroneously classified as false (98.46% sensitivity), our strategy 1 demonstrated excellent properties and elements compatible with other linkage strategies with Brazilian datasets [20,32].In addition, when incorporating the deaths into the SINAN database, the degrees of agreement for the kappa indicator are high and close to 1 for all the variables of interest in this study, revealing that the adoption of a higher cutoff point (≥ 17.9) does not imply losses for future analyses.The mortality rates observed at the two cutoff points established by the linkage strategy revealed a scenario consistent with the investigations on the subject.That is, advanced age, black race/color, a shorter length of schooling, and disabilities and mental and behavioral disorders [35][36][37] were associated with higher mortality rates.Thus, these results corroborate the effectiveness of the probabilistic data relationship strategy in the production of reliable information on this topic.The same occurs in relation to gender markers and sexual orientation.It was observed that transgender women and those who have sex with other women have higher mortality rates than heterosexual cisgender women.Men, whether they are cisgender or transgender, as well as men who have sex with men, also have a higher risk of dying than heterosexual cisgender women [38][39][40][41].Thus, having established the cutoff point and the best strategies for capturing information in large datasets, it is essential to scrutinize in more detail how sexual and gender markers affect the risk of dying in the population.
Furthermore, the number of deaths among travestis and transgender people is likely higher than that captured in the linkage process, resulting in classification bias.While cisgender men and women have their names established passively, that is, during birth [42], transgender and nonbinary people are actively involved in the choice of this name and still face important bureaucratic barriers to be able to make it compatible with their identity [16].It's conceivable that the name recorded in the SINAN form for some transgender individuals may have been changed over time [19].It's plausible to consider that the deaths of some transgender individuals may not have been accurately captured during the linkage process.In these cases, the death will not be attributed to the person, resulting in a false-negative result and, once again, approaching the mortality rates between cisgender and transgender people.
Additionally, although the use of the method is promising for the analysis of mortality according to sexual orientation and gender identity, other precautions are necessary.Although the quality of filling out the SINAN has improved over the years [43], notably, the present study found that sexual orientation was not reported (i.e., it was considered "unknown" or "omitted") in 44.45% of the patients.A similar situation can be observed regarding "self-reported race/color", "marital status", and, even more markedly, regarding the variable "schooling".
The quality and completeness of information in Brazilian health information systems have been constant and long-standing challenges for researchers and health managers since the omission of such data greatly compromises the analytical capacity and formulation of public policies [44][45][46].Because they are constant variables and are known to be associated with asymmetries and inequities in health [47][48][49][50][51], their omission in analytical models, either as a confounding variable or, in some cases, as an exposure variable, can produce strong biases in the production of knowledge on the subject.In this vein, reflecting on professional awareness strategies for the incorporation of these practices of sensitive and attentive questioning to users, as well as awareness that these factors are preponderant for the construction of health actions, are currently urgent elements.
The classification of the "gender identity" variable in SINAN demands careful attention as the system prescribes specific categories for travestis, transgender women, and transgender men [19].Consequently, individuals who do not fit these specified categories or whose identities were not recorded -labeled in the system as "unknown" or "omitted" -are automatically classified as cisgender.This oversimplified approach might not accurately capture the full spectrum of gender identities, particularly for those whose true identity was either not recorded or not inquired about.This limitation can compromise the tool's sensitivity and, in turn, in mortality analyses, data about the cisgender population may be skewed by inaccurately categorized deaths.This leads to a potentially flawed estimation of mortality rates among cisgender individuals compared to the groups of travestis, transgender women, and transgender men [17].
In addition to the limits already exposed and which must be considered when interpreting probabilistic relationships with these datasets, an important selection bias must be considered in future analyses.When studying one of the few databases that contains information on sexual orientation and gender identity, the SINAN database of interpersonal and self-inflicted violence, we partially apprehended the actual information experienced by the population.This is because not all people suffer violence, and not all people who suffer violence are notified in health services.
Studies have shown substantial differences in the prevalence of violence between primary surveys and the results obtained by SINAN [52][53][54].A notable example of these differences is the fact that the surveys indicate psychological violence as the most prevalent [55], while the SINAN data indicate physical violence as the one with the greatest magnitude of violent events [43].This difference may be associated with health professionals' own understanding of what the phenomenon of violence is, as well as the interpretation of which violence is worthy of reporting and which is not, since notions about this phenomenon are influenced by the sociohistorical conformations of society [56].In other words, culture, values, and social structure directly affect people's perceptions of what is and is not considered violence.They also produce a kind of stratification of which types of violence are more violent than others and, consequently, those that are worthy of immediate reporting and those that do not need to be recorded urgently [57].Thus, physical and sexual violence, as well as cases that imminently threaten life, might be more common than others.
This information bias results in the generation of statistics based on SINAN data that differ from the reality estimated by surveys [43,[52][53][54][55].In this context, analyses derived from linkages whose population base is based on SINAN should be interpreted with caution, as the records refer to a population subjected to specific contexts that may be different from those observed in the general population.Thus, the mortality rates observed in our study should not be taken as representative of the general population.On the contrary, their utility lies in the ability to conduct repeated measures over time, providing the monitoring of mortality according to sexual orientation and gender identity.As mentioned, the SINAN is the only Brazilian database where sexual and gender markers are officially monitored.
This bias can be circumvented by introducing a unique identification key [16] in information systems, such as civil registry numbers, the natural persons registry, or the national registry of the health system, as well as the introduction of the variables "sexual orientation" and "gender identity" into other data collection systems and not only in the SINAN database of interpersonal and self-inflicted violence [16,58].Until then, even with recognized inaccuracies and analytical limits, the performance of procedures involving probabilistic data relationships should be encouraged to better understand the needs of certain subpopulations, improve the technique for capturing information about the LGBTQ + population, and thereby advance the formulation of public policies compatible with the needs of this group.

Conclusion
Despite the inaccuracies produced by the data collection format of the variables "gender identity" and "sexual orientation", the probabilistic data linkage is an important technique with high applicability in the daily routine of health services.With its identification of possible blocking and pairing strategies and the detection of the best cutoff point for linkage scores, this technique becomes more useful in daily research and monitoring by health surveillance services.The application of this technique should be considered in monitoring mortalities over time, as it is useful in formulating public policies for the LGBTQ + population.

Fig. 1
Fig. 1 Flowchart of study's interest groups based on sex, gender identity, and sexual orientation variables from SINAN.State of Rio de Janeiro, Brazil, 2015-2020

Fig. 2
Fig. 2 Flowchart of the study population composition for probabilistic record linkage.State of Rio de Janeiro, Brazil, 2015-2020

Table 1
The results of probabilistic record linkage according to blocking and matching strategies.State of Rio de Janeiro, Brazil, 2015-2020 LegendID: identification of the data linkage strategy; P1: pairs formed after data linkage; TP1: proportion of true pairs formed after data linkage; DP: number of duplicate pairs; P2: pairs formed after the deduplication strategy; TP2: true pairs formed after deduplication; %TP2: proportion of true pairs formed after deduplication; DB: birthdate; N: name; FN: soundex of the first name; LN: soundex of the last surname; M: mother's name; FM: soundex of the mother's first name; LM: soundex of the mother's last surname

Table 2
Accuracy by cutoff point of the best blocking and matching strategy (strategy 1).State of Rio de Janeiro, Brazil, 2015-2020 (n = 4065)

Table 3
Characteristics of the study population, mortality rates (per 1000 persons), and agreement among the cutoff points of the best strategy (#1) for each population subgroup.State of Rio de Janeiro, Brazil, 2015-2020 (n = 80.178)